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1  Introduction 


This  is  the  final  report  on  our  work  under  ONR  Contract  N00014-86-K-0726,  August  1,  1986 
through  July  31,  1989. 

The  major  results  are  in  two  areas: 

1.  Studies  of  systematic  design  procedures  for  a  class  of  structured  algorithms  often  encountered 
in  signal  processing  applications.  These  are  what  we  have  called  Regular  Iterative  Algorithms 
(RIAs)  for  which  our  results  are  summarized  in  Section  2. 

It  might  be  mentioned  that  these  ideas  have  been  successfully  used  by  one  of  our  former 
students  who  helped  to  develop  this  theory,  Dr.  S.  K.  Rao  of  AT&T  Bell  Laboratories  in 
Holmdel,  N.J.  Dr.  Rao  has  found  the  RIA  results  helpful  in  designing  several  fast  integrated 
circuit  chips  for  communications  and  signal  processing  applications,  some  of  which  are  being 
used  in  the  AT&T  -  ZENITH  joint  effort  on  High  Definition  Television  (HDTV). 

2.  The  other  major  area  of  effort  was  the  study  of  a  notable  family  of  algorithms  that  are  not  in 
the  RIA  form,  viz.,  those  associated  with  Viterbi  decoding  of  convolutional  and  trellis  codes 
or  more  generally  with  shortest-path  problems  in  graphs. 

This  work,  which  is  described  in  Section  3,  is  also  being  followed  by  Dr.  P.  G.  Gulak,  a 
postdoctoral  scholar  and  research  associate  on  the  contract,  who  is  now  teaching  at  University 
of  Toronto,  Canada.  Dr.  Gulak  is  having  special  chips  designed  and  built  by  Bell  Northern 
Research. 

2  Summary  of  Our  Work  on  Regular  Iterative  Algorithms 

Our  previous  work  has  shown  that  (see  e.g.  ,  [14,  15,  32,  33,  34,  35,  41])  that  once  a  Regular 
Iterative  Algorithm  is  designed  for  a  given  problem,  then  one  can  use  the  systematic  design  theory 
developed  by  us  to  generate  efficient  processor  arrays.  However,  most  algorithms  are  not  given  to 
the  designer  in  the  RIA  form  and  most  initial  representations  are  either  sequential  in  nature  {e.g., 
FORTRAN  or  PASCAL  programs)  or  general  mathematical  expressions.  We  have  thus  developed  a 
formal  methodology  for  systematic  formulation  of  RIAs  starting  from  representations  that  we  refer 
to  as  linearly  indexed  Assignment  Codes.  It  can  be  shown  that  such  codes  are  very  close  to  the 
mathematical  expressions  of  a  wide  variety  of  problems,  especially  in  signal  processing  and  matrix 
algebra. 

In  this  section,  we  shall  first  briefly  introduce  RIAs  and  summarize  our  contributions  in  the 
analysis  and  implementation  of  such  algorithms.  We  shall  then  briefly  summarize  our  formal 
methodology  for  deriving  RIAs  starting  from  general  representations  such  as  mathematical  formu¬ 
las. 
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2.1  Regular  Iterative  Algorithms  and  Our  Contributions 

A  formal  definition  of  RIAs  can  be  found  in  [17,  35,  41];  here  we  shall  introduce  RIAs  via  a  simple 
example. 

Example  (2-D  Filtering  Algorithm):  It  can  be  shown  (see  [32, 35])  that  certain  numerically 

stable  2-D  filtering  algorithms  due  to  Deprettere  and  Dewilde  [5],  Vaidyanathan  and  Mitra  [47], 
and  Fettweis  [6],  can  all  be  written  in  the  form: 

For  all  where  0  <  i  <  n  and  0  <  j, /:  <  AT,  do 

x{i,j  -I- 1,*-}-  1)  =  fxA^iiJ,k),y{iJ,k),wii,j,k)) 
y(i  -h  l,j,k)  =  /y,i(x(i,i,fc),j/(i,  j,fc),u;(i,i,fc)) 
w(i  -  lj,k)  =  /,„,,(x(t,j,fc),u;(i,i,*:)) 

where  A,,,-,  /j,,.-,  are  linear  functions  that  are  determined  by  a  synthesis  procedure. 

□ 

The  example  displays  the  following  (characteristic)  features  of  an  RIA: 

Each  variable  in  the  RIA  is  identified  by  a  label  (e.g.,  x,  y  ot  w  in  example  1)  and  an  index 
vector  {e.g.  ,  1  =[i  j  k]'^ ,  in  example  1).  The  set  of  all  index  points  over  which  the  variables 
of  the  RIA  are  defined  is  called  the  index  space,  which  is  a  subset  of  the  an  S-dimensional 
integer  lattice,  Z^. 

The  dependences  among  the  variables  are  regular  with  respect  to  the  index  points.  That  is, 
if  xi(/)  is  computed  using  the  value  of  X2(I  —  di2)  then  the  index  displacement  vector  di2> 
corresponding  to  this  direct  dependence,  is  the  same  regardless  of  the  index  point  1. 

The  set  of  computations  performed  at  every  index  point  is  often  referred  to  as  the  iteration  unit 
of  the  RIA.  Also,  note  that  although  the  direct  dependences  among  the  variables  in  an  RIA  are 
required  to  be  independent  of  the  index  points,  the  actual  computations  carried  out  to  evaluate 
these  variables  can  depend  on  the  index  point.  In  general,  the  index  space  I  will  be  semi-infinite 
along  certain  coordinates  and  bounded  along  others.  The  bounds  on  the  coordinates  will  be  referred 
to  as  the  size  parameters  of  the  RIA. 

The  regular  dependences  of  an  RIA  lead  to  a  dependence  graph  with  an  iterative  structure, 
which  can  be  clearly  demonstrated  by  embedding  the  dependence  graph  within  the  index  space. 
That  is,  a  set  of  V  nodes  is  defined  at  every  index  point  I  in  the  index  space  I,  where  the  node 
represents  the  variable  x,(/)  in  the  RIA.  As  first  noted  by  Karp  et  al.  [17]  and  by  Waite  [50],  the 
regularity  of  the  dependence  graph  of  an  RIA  can  be  concisely  expressed  in  terms  of  a  simpler  and 
smaller  graph  called  the  Reduced  Dependence  Graph  (RDG).  The  RDG  of  an  RIA  (see  Fig.  1)  has 
one  node  for  each  of  the  indexed  variables  in  the  RIA;  it  has  a  directed  arc  from  node  x;  to  node  Xj, 
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Figure  1:  The  RDG  of  the  RIA  in  the  above  example. 

if  Xj{I)  is  computed  using  the  value  of  Xj(/  — djj)  for  some  dy;  finally,  each  directed  arc  is  assigned 
a  vector  weight  representing  the  displacement  of  the  index  point  across  the  direct  dependence.  We 
should  note  that  the  RDG  and  a  specification  of  the  index  space  I,  completely  characterize  the 
dependence  graph  of  an  RIA;  hence,  the  analysis  of  parallelism  in  an  RIA  is  based  on  the  analysis 
of  the  RDG  instead  of  the  larger  dependence  graph. 

Some  of  our  important  results  are  enumerated  below;  for  a  detailed  account  of  the  work  reported 
here  previous  work  please  see  [32,  41] 

1.  A  formal  definition  of  systolic  arrays  was  obtained  that  captured  their  generally  accepted 
properties,  especially  regularity  (mostly  identical  processors),  spatial  locality  (local  intercon¬ 
nections),  temporal  locality  (no  delay-free  operations,  or  more  precisely,  all  combinational 
elements  are  latched)  and  pipelined  operation  (throughput  independent  of  the  order,  suitably 
defined,  of  the  system).  Some  authors  {e.g,  Leiserson  et  al.  [24])  had  used  only  a  subset 
of  these  properties,  but  the  consensus  in  the  literature  appeared  to  have  required  all  those 
mentioned  above  (see  e.g.,  [34]  and  [21]). 

2.  A  reasonable  generalization  of  the  concept  of  systolic  arrays  that  allowed  implementation  of  a 
larger  class  of  algorithms  (including  of  course  all  systolic  algorithms)  was  also  developed.  The 
generalization  allowed  the  presence  of  register  pipelines  of  various  lengths  at  different  points  in 
a  regular  array  of  (mostly)  identical  processors,  and  sometimes  also  some  LIFO  (Last-In-First- 
Out)  buffers.  Such  architectures  have  almost  all  the  advantages  that  make  systolic  arrays  so 
appealing  for  VLSI;  the  only  added  requirement  is  that  some  of  the  processors  may  require 
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certain  amount  of  memory  in  them.  We  should  note  here  that  the  memory  requirement  is 
not  a  major  bottleneck,  and  certain  commercial  products  such  as  the  WARP  developed  at 
CMU,  routinely  provide  such  on-processor  memory. 

Rao  et  al.  called  such  arrays  Regular  Iterative  Arrays,  and  algorithms  implementable  on  such 
arrays  were  dubbed  as  Regular  Iterative  Algorithms.  It  is  convenient  to  use  the  acronym  RIA 
to  stand  for  either  of  these  concepts,  the  exact  one  to  be  inferred  from  the  context.  Using 
the  above  concepts,  and  their  consequences,  one  can  show  for  example  that  there  are  Regular 
Iterative  Algorithms  {e.g.  ,  RIAs  for  certain  classes  of  2-D  filtering  algorithms,  RIAs  for 
certain  pivoting  algorithms  [41,  36,  37]  etc.)  that  cannot  be  implemented  on  systolic  arrays, 
as  formally  defined,  but  can  be  implemented  on  regular  iterative  arrays. 

3.  It  was  also  shown  [15,  21,  32,  41]  that  many  algorithms  in  digital  filtering  (convolution, 
correlation,  autoregressive,  and  moving-average  filtering),  numerical  linear  algebra,  discrete 
methods  for  PDEs  and  ODEs,  graph  theory  (transitive  closure,  some  coloring  problems) 
can  be  reformulated  as  RIAs.  Systematic  procedures  for  converting  algorithms  into  RIAs, 
however,  remained  as  an  open  problem. 

4.  For  any  RIA,  formal  methods  to  determine  lower  bounds  on  I/O  latency  and  memory  require¬ 
ments  were  developed;  systematic  procedures  for  implementing  RIAs  on  regular  processor  ar¬ 
rays  that  can  achieve  the  lower  bound  on  I/O  latency  were  also  proposed  (see  [32,  33,  35,  41]). 
We  should  mention  here  that  these  formal  mapping  techniques  can  generate  all  possible  archi¬ 
tectures,  though  in  practice  one  stops  once  a  few  efficient  (i.e.  ,  meeting  certain  performance 
lower  bounds)  arrays  have  been  obtained. 

5.  In  the  design  of  systolic  arrays,  several  issues  such  as  systematic  procedures  for  designing 
multi-rate  systolic  arrays  were  resolved.  In  the  conventional  systolic  array  designs  all  oper¬ 
ations  were  assumed  to  take  the  same  amount  of  time;  this  led  to  unrealistic  and  inefficient 
design.  Our  design  procedure  allows  one  to  carry  out  the  design  with  more  realistic  processor 
modules  that  can  increase  the  throughput  by  exploiting  the  fact  that  the  time  required  to 
carry  out  different  operations  is  generally  different. 

2.2  Systematic  Formulation  of  RIAs 

Let  us  first  introduce  the  concept  of  localized  algorithms  that  are  close  to  RIAs  (see  e.g.  ,  [41,  40,  39, 
16,  42]).  The  definition  of  the  localized  algorithms  is  motivated  by  the  observation  that  there  are 
certain  problems  that  can  be  solved  by  algorithms  that  have  rejfu/ar  dependence  graphs  that  are  not 
completely  homogeneous.  That  is,  the  dependence  graphs  may  have  dependencies  or  computations 
that  are  present  only  in  certain  portions  of  the  dependence  graphs.  As  we  shall  discuss  in  [41],  one 
way  of  handling  such  cases  is  to  assume  that  the  dependences  and  the  computations  are  present 
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everywhere  in  the  index  space  and  then  to  apply  the  results  for  RIAs.  There  are  several  problems 
where  this  approach  is  reasonable;  for  example  the  Gaussian  elimination  algorithm  without  pivoting 
can  be  first  written  in  the  localized  algorithm  form  and  then  can  be  implemented  on  processor  arrays 
by  modeling  the  localized  algorithm  as  an  RIA.  The  other  approach  is  to  break  up  the  dependence 
graph  into  more  than  one  component  such  that  the  dependence  graph  is  homogeneous  over  each 
component.  The  mapping  techniques  can  then  be  applied  to  each  such  component  with  special 
consideration  to  the  dependences  at  the  boundaries  between  the  components.  The  latter  approach 
is  discussed  in  more  detail  in  [41]  where  the  example  of  Gauss-Jordan  elimination  algorithm  is 
worked  out. 

The  localized  algorithms  have  statements  of  the  form 

Xi(I)  =  fi(xi{I  -  dii),---,  xv{I  -  div))  V/elj.  (1) 

Thus  each  statement  in  this  algorithm  may  have  a  different  index  space  of  its  own;  as  a  comparison, 
all  statements  in  an  RIA  have  the  same  index  space. 

Partial  attempts  have  been  made  by  several  authors,  including  [18],  [25],  [20],  [27],  [4]  and  [10],  to 
formalize  the  conversion  procedure  for  going  from  an  initial  representation  to  an  RIA  or  a  localized 
algorithm.  The  first  step  always  is  to  convert  algorithms  into  equivalent  Single  Assignment  Codes 
(SACs)  and  the  second  step  tries  to  localize  the  dependences  by  eliminating  broadcasts.  Single 
assignment  codes  [2]  are  representations  where  every  variable  defined  in  the  algorithm  takes  on  a 
unique  value  during  the  course  of  execution.  The  fact  that  the  dependence  graph  of  an  algorithm  can 
be  easily  determined  from  its  SAC,  has  made  SACs  a  very  useful  starting  representation  for  parallel 
implementations  of  algorithms.  Considering  its  importance,  a  lot  of  work  has  been  done  in  trying  to 
convert  sequential  algorithms  into  SACs,  see  e.g.,  [26].  However,  sequential  algorithms  are  not  the 
only  representations  from  which  SACs  can  be  derived.  Often  SACs  can  be  derived  systematically 
from  given  mathematical  expressions.  Consider  a  mathematical  expression  for  matrix  multiplication 

For  all  tuples  (t,  j) ,  1  <  x,  j  <  ra  do 


and  a  SAC 


Cij  “  51  aik-bkj. 

for  all  i<fc<n 

For  all  triples  (x,j,fc),  l<i,j,k<ndo 


(2) 


c{i,j,k  +  1)  :=  c(x,  j,  k)  +  a,fc  •  6jtj  (3) 

In  the  mathematical  expression,  the  ordering  of  operations  in  the  inner  product  is  not  specified 
and  in  fact  it  can  be  arbitrary  because  of  the  commutativity  and  associativity  of  the  operation  +. 
However,  in  the  given  SAC  the  ordering  is  fixed  and  a  degree  of  freedom  has  been  lost.  Since  the 
original  representation  has  more  freedom  and  potential  parallelism  in  it,  it  would  be  desirable  to 
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make  it  the  starting  representation  and  then  systematically  derive  one  or  more  SACs  from  it.  It 
turns  out  that  a  number  of  algorithms  can  be  written  in  the  form  of  (2)  (see  [41,  40,  39]),  and 
we  shall  refer  to  such  representations  as  Assignment  Codes  (ACs).  The  prefix  Single  has  been 
intentionally  dropped  to  emphasize  the  fact  that  in  such  representations  the  number  of  inputs  for 
computing  a  variable  may  depend  on  the  problem  size,  as  opposed  to  a  conventional  SAC  where  the 
number  of  inputs  to  every  variable  is  restricted  to  be  some  constant,  independent  of  the  problem 
size.  From  now  on  we  shall  refer  to  the  number  of  inputs  to  a  variable  as  its  in-degree  and  the 
number  of  variables  that  a  particular  variable  is  input  to  as  its  out-degree.  If  the  in-  or  out-degree 
of  a  variable  depends  on  the  problem  size,  then  we  shall  define  it  to  be  unbounded.  Thus,  the 
variables  in  ACs  can  have  unbounded  in-  and  out-degrees,  whereas  in  SACs  the  variables  have 
bounded  (i.e.,  constant)  in-degrees  but  may  have  unbounded  out-degrees. 

We  shall  further  restrict  ourselves  to  linearly  indexed  ACs,  which  can  be  shown  to  be  very  close 
to  mathematical  expressions  for  a  number  of  problems,  especially  in  signal  processing  and  matrix 
algebra.  A  linearly  indexed  AC  has  statements  of  the  form 

x( PI  +  d)  depends  on  y(QI  -j-  e)  for  all  /  e  I  C  (4) 

where  P  and  Q  are  integral  matrices  independent  of  /,  I  is  an  index  space  which  is  the  set  of  all 
lattice  points  enclosed  within  a  specified  region  in  a  5-dimensional  EucHdean  space  and  d,  e  are 
constant  displacement  vectors.  P  and  Q  are  often  referred  to  as  the  indexing  matrices.  We  have 
shown  in  [41,  40,  39]  that  in-  and  out-degrees  of  variables  x  and  y  are  completely  determined  by  the 
structure  and  dimension  of  the  right  nuU-space  of  each  of  the  indexing  matrices.  Many  algorithms 
are  actually  directly  available  as  (4),  and  examples  include  the  formulas  for  matrix  multiplication, 
any  m-dimensional  convolution/correlation,  matrix  transposition,  and  solving  matrix  Lyapunov’s 
equation.  Algorithms  that  are  not  directly  in  the  form  of  (4)  can  often  be  easily  put  in  that  form 
by  analyzing  their  sequential  representations  (see  [41]). 

Example  2:  The  formula  for  matrix  multiplication  is: 

For  all  tuples  (i, j)  ,  I  <  i,j  <ndo 


Cij  := 

for  all  i<fc<n 

The  index  space  of  the  example  is  I  =  I  1  <  j.t  <  »)•  There  is  one  functional  relation 

in  the  given  AC  with  the  dependence  matrices 


Pc  = 


1  0  0 
0  10 


Qac  = 


10  0 
0  0  1 


Qbc  = 


0  1  0 
0  0  1 


and  the  displacement  vectors  d,  e  —  0. 

We  have  shown  that  a  Unearly  indexed  AC  can  be  systematically  decomposed  into  a  linearly  indexed 
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SAC  and  a  linearly  indexed  dual  SAC.  Linearly  indexed  dual  SACs  may  have  variables  with  un¬ 
bounded  in-degrees  but  bounded  out-degrees,  as  opposed  to  the  linearly  indexed  SACs,  which  have 
variables  with  bounded  in  degrees  but  possibly  unbounded  out  degrees.  Formal  procedures  will  be 
then  outlined  for  converting  linearly  indexed  SACs  and  linearly  indexed  dual  SACs  into  localized 
algorithms.  The  conversion  of  linearly  indexed  SACs  to  localized  algorithms  involves  eliminating 
global  dependencies  by  propagating  variables  in  a  systematic  manner  in  the  index  space.  The 
conversion  of  dual  SACs  to  localized  algorithms  is  achieved  by  distributing  computations  and  in¬ 
troducing  an  ordering  among  the  computations.  The  two  conversion  procedures  turn  out  to  be 
duals  of  each  other.  We  should  mention  here  that  starting  with  linearly  indexed  ACs  is  by  no 
means  essential  in  our  approach;  if  one  cannot  find  a  AC  easily,  then  one  can  try  to  use  other 
well-known  techniques  and  start  the  procedure  with  a  linearly  indexed  SAC. 

In  summary,  we  have  developed  a  hierarchical  procedure  for  going  from  a  higher  level  represen¬ 
tation  of  an  algorithm  to  a  localized  algorithm,  which  can  be  described  by  an  RIA  or  a  localized 
algorithm.  It  can  be  described  as  follows: 

Mathematical  Description  — >  Linearly  Indexed  Assignment  Codes  -+  Linearly  Indexed 

Single  Assignment  Codes  and  dual  Single  Assignment  Codes  — *•  Localized  Algorithms. 

The  conversion  procedure  is  by  no  means  unique  and  a  number  of  localized  algorithms  can  be  gen¬ 
erated  starting  from  the  same  AC.  To  enable  an  efficient  choice  we  have  also  developed  procedures 
to  directly  schedule  and  analyze  linearly  indexed  codes  of  the  form  (4).  For  example,  we  have 
developed  necessary  and  sufficient  conditions  for  determining  whether  a  sequence  of  SACs  of  the 
form  (4)  can  be  scheduled  using  affine  schedules  (see  [41,  16,  42]).  Procedures  to  schedule  linearly 
indexed  codes  that  do  not  admit  affine  schedules  are  also  discussed  in  [41]. 

3  Summary  of  Our  Work  on  Parallel  Architectures  for  Viterbi 
Decoders 

In  this  section  we  shall  summarize  our  work  on  the  development  of  parallel  architectures  for  the 
optimal  decoding  of  convolutional  and  trellis  codes  using  the  Viterbi  Algorithm. 

The  Viterbi  Algorithm  (VA)  [48],  [49],  [9],  which  is  widely  used  in  communication  systems  using 
convolutional  and  trellis  codes,  is  essentially  an  algorithm  for  finding  a  minimum  distance  path  in  a 
so-called  trellis  diagram.  There  have  been  several  implementations  of  the  VA  ranging  from  totally 
sequential  to  fuUy  parallel  multiprocessor  implementations  based  on  the  state  transition  diagrams 
of  the  encoders.  By  restricting  ourselves  to  codes  generated  by  linear  systems  over  GF(q),  we  are 
able  to  present  some  apparently  novel  parallel  architectures  for  decoding  rate  k/n  linear  codes, 
and  asymptotically  area-efficient  implementations  for  the  Thompson  VLSI  grid  model.  For  fc  =  1, 
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an  equivalence  mapping  between  de  Bruijn  graphs  and  shuffle-exchange  networks  is  used  to  relate 
previously  proposed  architectures  for  decoding  rate  1/n  codes  to  the  straight  forward  architectures 
based  on  the  state  transition  graphs  of  their  encoders.  Although  these  architectures  result  fast 
implementation,  they  require  global  communication,  Hence,  we  have  also  studied  implementations 
that  have  oidy  local  communications  and  require  less  silicon  area. 

In  the  rest  of  this  summary  we  shall  briefly  describe  our  work  and  relate  it  to  the  work  done  by 
other  researchers. 

3.1  The  Viterbi  Algorithm  and  its  Parallel  Implementations 

The  basic  theory  behind  the  VA  is  readily  available  in  the  literature;  a  good  survey  was  provided 
by  Forney  [9],  who  introduced  the  concept  of  the  trellis  exposition  of  the  decoding  algorithm. 
The  trellis  diagram,  is  a  graphical  representation  of  the  state  diagram  of  the  encoder  drawn  as  a 
function  of  discrete  time.  Fig.  3  shows  the  state  transition  and  trellis  diagrams  for  the  binary  rate 
1/2  convolutional  encoder  shown  in  Fig.  2.  Each  time  step  corresponds  to  a  single  symbol  interval 
T  (defined  as  the  time  interval  between  two  consecutive  output  symbols  of  the  receiving  channels), 
and  defines  one  stage  of  the  trellis.  Each  node  at  every  stage  of  the  trellis  diagram  represents  one 
possible  state  of  the  encoder.  If  g  is  the  number  of  possible  alphabets  and  v  is  the  number  of 
memory  elements,  then  each  stage  of  the  trellis  has  nodes.  There  is  an  edge  between  the  node 
{i.e.,  the  node  representing  state  5,-  at  stage  t)  and  the  node  if  and  only  if  there  is  a  directed 
edge  from  state  5,-  to  Sj  in  the  state  transition  diagram  of  the  encoder.  If  there  are  k  inputs  to  the 
encoder  circuit  then  each  state  has  predecessors;  equivalently,  the  in-degree  of  every  node  in  the 
trellis  diagram  is  q'‘. 

Each  edge  (i.e.,  5j+^)  is  assigned  a  weight,  called  the  branch-metric.  The  VA  is  a  dynamic 

programming  algorithm  for  finding  a  minimum-distance  path  in  the  weighted  trellis  diagram  [28] 
and  can  be  described  as  follows.  Each  node  of  the  trellis  diagram  has  a  path  metric  and  a  survivor 
sequence  associated  with  it.  The  path  metric  Pj  of  the  node  Sj  at  time  t  is  the  weighted  length 
of  a  minimum-distance  path  between  the  starting  node  5°  and  the  node  5j  in  the  trellis  diagram; 
the  survivor  sequence  for  state  Sj  at  time  t  is  the  state  sequence  associated  with  this  minimum- 
distance  path.  Once  every  symbol  interval,  the  path  metrics  are  updated  as  follows: 

=  min(P/  -I-  V  i  such  that  Si  Sj  (5) 

where  Fq  implies  that  there  is  a  valid  state  transition  from  state  Si  to  Sj.  The 

old  survivor  sequence  of  the  winning  ancestor  is  augmented  with  the  symbol  corresponding  to  the 
transition  to  state  Sj  to  form  the  new  survivor  sequence  for  the  state  Sj.  After  a  sufficiently 
long  time  L,  (see  e.g.  [49])  the  survivor  sequence  of  the  state  with  the  minimum  path  metric  is 
chosen  to  be  the  estimate  for  the  state  sequence  of  the  encoder;  the  decoding  procedure  is  then 
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completed  by  determining  the  input  sequence  corresponding  to  the  estimated  state  sequence.  [In 
actual  implementations,  many  practical  issues  must  be  accommodated  such  as  truncation  of  the 
survivor  sequence,  extraction  of  the  estimated  source  symbols,  and  path  metric  overflow  control  in 
finite-word  length  registers.] 

Thus,  at  each  node  of  the  trellis  diagram,  the  VA  has  to  perform  additions  and  compar¬ 
isons  (one  for  each  predecessor)  to  update  the  path  metrics  and  survivor  sequences;  hence,  the 
total  number  of  operations  to  be  executed  during  a  symbol  interval  is  A  sequential 

implementation  would  require  addition  and  comparison  operations  and  q"  random  accesses 
to  the  processor’s  memory  during  each  symbol  interval  T.  Hence,  the  throughput  rate  of  such 
an  implementation  decreases  exponentially  with  increase  in  the  constraint  length  and  may  not  be 
acceptable  in  applications  where  high  data  rates  are  required. 

One  can  however  trade  time  for  hardware  and  implement  the  VA  in  parallel  with  increased 
hardware  complexity  but  reduced  T  (hence,  higher  data  rates).  An  intuitively  obvious  architecture 
can  be  obtained  by  projecting  the  trellis  diagram  along  the  time  direction  and  implementing  the 
VA  on  the  resulting  architecture,  which  will  consist  of  a  set  of  g"  processors  connected  according  to 
the  state  transition  diagram  of  the  encoder.  We  shall  refer  to  this  architecture  as  the  fully  parallel 
architecture  for  implementing  the  VA.  At  every  time  step,  each  processor  performs  the  operations 
represented  by  (5),  i.e.  ,  it  receives  the  output  symbol  and  generates  the  branch  metrics  for  each  of 
its  predecessor;  then  it  has  to  perform  additions  and  select  the  minimum  among  them  followed 
by  updating  of  the  survivor  sequence.  Hence,  for  a  rate  k/n  encoder  the  symbol  interval  T  is  0(g*) 
and  a  gain  of  0(g")  over  a  sequential  processor  has  been  achieved. 

3.1.1  Previous  Work 

Kriete  and  Cain  [1]  used  a  fully  parallel  architecture  (along  with  several  tricks  in  order  to  keep  the 
path  metrics  small)  to  implement  the  VA  for  a  rate  1/2  feedforward  encoder  with  constraint  length 
u  =  7  in  VLSI  (Very  Large  Scale  Integrated)  circuit  technology.  However,  they  did  not  address 
issues  such  as  area-efficient  implementations  of  Viterbi  decoders  for  arbitrary  constraint  length, 
or  alternate  parallel  architectures  that  may  be  simpler  than  the  architectures  based  on  the  state 
diagrams.  Lower  silicon  area  enables  the  designer  to  put  more  processors  on  a  single  chip  and  yet 
have  a  high  yield;  it  also  may  lead  to  higher  speed  by  reducing  the  length  of  interconnecting  wires. 
And  of  course,  if  the  decoder  architecture  is  large  compared  to  the  available  resources,  then  one 
requires  efficient  strategies  for  executing  the  larger  sized  problem  on  the  available  resources. 

Gulak  and  Shwedyk  [12],  [11]  showed  that  by  carefully  analyzing  the  trellis  diagram  one  can 

quantity  f(N)  is  said  to  be  0{g(N))  if  there  exists  k>0  such  that  f{N)  <  kg{N)  for  sufficiently  large  N.  It 
is  U{g{N))  if  for  some  ki  >  0,  f{N)  >  kig{N)  for  sufficiently  large  N.  Finally,  f{N)  is  Q(g{N))  if  it  is  both  0{g(N)) 
and  Q{g{N)). 
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map  the  VA  algorithm  for  the  special  ca^e  of  rate  1/n  FIR  encoders  onto  weU-known  architectures 
for  parallel  processing  called  shuffle-exchange  networks  [44],  [46].  Optimum  VLSI  layouts  are 
already  known  [19],  [22],  [45],  [13]  for  such  networks  and  hence  the  same  layouts  can  be  used 
for  Viterbi  decoders.  Moreover,  it  is  well-known  that  shuffle-exchange  networks  are  functionally 
equivalent  to  a  whole  family  ;of  other  popular  networks  such  as  hypercubes,  cube-connected  cycles, 
butterflies,  omega  networks,  etc.,  [30],  [46];  thus  the  VA  for  rate  1/n  FIR  encoders  can  be  efficiently 
implemented  on  any  of  these  architectures;  however,  among  these  architectures,  the  shuffle-exchange 
networks  have  the  least  VLSI  area  for  the  same  number  of  nodes.  We  should  also  comment  that, 
with  hindsight,  the  relationship  between  trellis  diagrams  of  rate  1/n  FIR  encoders  and  shuffle- 
exchange  networks  should  come  as  no  surprise.  As  early  as  1973,  Forney  [9]  and  then  Rader  [31] 
had  noted  the  equivalence  between  the  trellis  diagrams  of  rate  1/n  feedforward  encoders  and  the 
dependence  graphs  of  FFT  algorithms  and  the  fact  that  FFTs  can  be  efficiently  implemented  on 
shuffle-exchange  networks  had  been  well-known  even  before  that  [44].  However,  in  our  work  we 
have  presented  a  much  more  direct  connection  between  the  trellis  diagram  and  the  shuffle  exchange 
networks  by  pointing  out  a  simple  procedure  for  mapping  the  state  transition  graphs  (which  are 
known  as  de  Bruijn  graphs  or  Good’s  diagrams)  to  the  shuffle  exchange  graphs. 

A  different  family  of  parallel  implementations  was  presented  by  Chang  and  Yao  [3].  They 
interpreted  the  VA  (both  for  rate  1/n  and  rate  k/n  feed- forward  convolutional  encoders)  as  a 
sequence  of  matrix-vector  multiplications  (where  the  usual  +  operation  is  replaced  by  the  mm 
operation  and  the  usual  multiplication  operation  is  replaced  by  addition)  and  then  implemented  the 
VA  using  systolic  architectures  already  developed  in  the  literature  for  matrix- vector  multiplication. 
The  implementation  uses  0{q'')  processors;  however,  the  symbol  interval  T  is  at  best  0{q''jv) 
and  thus  the  gain  in  speed  over  that  of  the  sequential  processor  is  at  most  qv.  Hence,  with  an 
exponential  number  of  processors,  the  gain  achieved  in  throughput  rate  is  at  best  logarithmic. 

3.1.2  Our  Contributions 

Let  us  first  consider  the  fully  parallel  architectures  where  several  questions  remain  unanswered  in 
this  area.  In  particular,  rate  k/n  convolutional  codes  {k  >  1)  have  better  distance  properties  and 
lower  error  probabilities  than  rate  1/n  codes  [49]  and  are  widely  used  in  practice.  However,  fully 
parallel  architectures  for  decoding  rate  k/n  codes  based  on  the  state  transition  diagrams  become 
complicated  and  no  alternative  simpler  architectures  or  area-efficient  VLSI  implementations  seem 
to  have  been  described  in  the  literature. 

In  our  work  we  observed  that  the  state  transition  diagram  of  any  rate  1/n  feedforward  encoder, 
which  is  a  de  Bruijn  graph  (also  referred  to  as  a  Good’s  diagram),  can  be  directly  mapped  to  a 
shuffle-exchange  network.  The  simple  mapping  technique  has  been  independently  discovered  by 
several  researchers  [23],  [29],  [38],  and  can  be  used  to  show  that  optimal  VLSI  layouts  for  de  Bruijn 
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graphs  can  be  obtained  by  suitably  modifying  the  optimal  layouts  of  shuffle-exchange  networks. 
The  resulting  layouts  of  N  nodes  de  Bruijn  graphs  have  area  of  log^  iV)  and  are  only  a 

constant  factor  larger  than  the  layouts  of  the  corresponding  shuffle-exchange  networks.  There 
may  be,  however,  advantages  in  implementing  de  Bruijn  networks  because  unlike  shuffle-exchange 
networks,  they  are  fault  tolerant  and  work  efficiently  in  the  presence  of  a  single  faulty  node  or  link 
[43].  We  also  show  that  the  state  diagrams  of  rate  1/n  encoders  with  feedback  when  realized  in  a 
certain  canonical  form  still  have  the  structure  of  de  Bruijn  graphs;  hence,  the  decoder  architectures 
for  feedforward  encoders  can  be  used  to  decode  codes  generated  by  encoders  with  feedback. 

For  rate  k/n  feedforward  encoders  realized  in  an  ‘obvious’  manner,  we  have  shown  that  the  state 
diagrams  can  be  represented  as  Cartesian  products  of  k,  possibly  distinct,  de  Bruijn  graphs.  The 
resulting  product  graph  representation  is  much  simpler  than  the  original  state  transition  diagram 
and  architectures  based  on  the  representation  does  not  suffer  any  loss  in  performance.  Minimum 
area  VLSI  layouts  for  the  product  graphs  are  presented  using  a  recursive  layout  technique  that 
uses  the  optimal  layout  strategy  for  de  Bruijn  networks.  Also,  we  prove  that  the  optimal  layouts 
of  the  product  graphs  save  at  least  a  factor  of  0{q^)  in  silicon  area  compared  to  the  direct  layouts 
of  the  state  transition  diagrams.  It  is  also  shown  that  under  certain  conditions  one  may  choose  to 
implement  syndrome  decoding  based  on  the  state  diagram  of  the  dual  encoder.  The  dual  encoders 
are  always  feedforward  [7],  [8]  and  may  have  simpler  state  diagrams  (if  k  >  n/2)  than  the  original 
encoder. 

For  the  general  case  of  rate  k/n  encoders  with  feedback,  one  can  use  a  result  of  Forney  [7]  that 
every  rate  k/n  convolutional  code  can  be  regarded  as  having  been  generated  by  a  minimal  encoder, 
which  is  necessarily  feedforward  and  has  minimum  constraint  length.  Hence,  any  rate  k/n  code 
can  be  decoded  by  a  decoder  based  on  the  state  transition  diagram  of  a  corresponding  feedforward 
minimal  encoder,  with  the  resulting  codeword  converted  to  the  encoder  input  by  a  trivial  linear 
operation.  Thus  in  general,  the  VA  for  any  rate  k/n  encoder  can  be  always  implemented  in  parallel 
on  a  set  of  processors  connected  according  to  a  Cartesian  product  of  de  Bruijn  graphs. 

The  fully  parallel  architectures,  discussed  above,  are  fast,  but  require  global  communications  and 
large  silicon  area.  Hence,  with  Dr.  P.  G.  Gulak,  a  postdoctoral  scholar  and  later  a  research  associate, 
we  have  also  studied  implementations  that  have  only  local  communications  and  require  less  silicon 
area.  The  study  of  such  implementations  have  led  us  to  the  design  of  so-called  cascade  architectures 
that  can  be  shown  to  have  the  best  (area)  x  (pipeline-period)  product.  A  paper  on  this  and  related 
architectures  has  been  published.  [  P.  G.  Gulak  and  Thomas  Kailath,  “Locally  Connected  VLSI 
Architectures  for  The  Viterbi  Algorithm”,  Journal  on  Selected  Areas  in  Communication,  6,  pp. 
527-537  April  1988.  ] 
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3.2  VLSI  Implementation 

In  the  past  year,  considerable  effort  has  been  put  in  by  Dr.  P.  G.  Gulak  into  utilizing  theoretical 
insights  generated  by  our  work  to  obtain  VLSI  implementations  for  Viterbi  decoders.  The  project 
is  a  rather  long  one  and  Dr.  Gulak  has  been  continuing  it  at  University  of  Toronto,  Canada,  where 
he  is  an  Assistant  Professor.  An  experimental  VLSI  implementation  of  a  soft-decision,  binary  rate 
1/2,  constraint  length  three,  convolutional  decoder  in  3  fim  CMOS  technology  has  been  carried 
out.  It  has  been  submitted  to  Bell  Northern  Research  (BNR)  for  fabrication  and  a  set  of  chips  is 
expected  to  be  tested  by  the  end  of  this  year.  A  brief  description  of  the  chip  is  presented  below . 

The  architecture  is  based  on  a  number  of  simple  processors,  each  performing  an  Add-Compare- 
Select  (ACS)  operation,  and  connected  according  to  the  state  transition  graph  of  the  encoder, 
which  is  a  de  Bruijn  graph  for  rate  1/ra  codes.  For  our  experimental  design,  the  decoder  has  four 
processors  (each  processor  corresponds  to  one  state  of  the  rate  1/2,  constraint  length  three  encoder) 
as  shown  in  Fig.  4.  Two  entries,  namely  a  path  (or  state)  metric  and  a  survivor  sequence,  are  stored 
at  each  node.  The  survivor  sequence,  represents  the  earlier  symbols  which  led  to  the  presumed 
present  state,  through  valid  state  transitions.  The  state  metric  is  the  weight  associated  with  the 
survivor  sequence.  Once  each  symbol  interval,  state  metrics  are  updated  by  adding  a  measure  of 
unlikelihood,  called  a  branch  metric,  defined  by  valid  state  transitions  from  each  possible  state  into 
a  given  present  state.  The  lowest  resulting  sum  defines  the  new  state  metric  for  the  given  present 
state.  The  old  survivor  sequence  in  the  survivor  path  of  the  winning  ancestor,  is  appended  with  the 
symbol  corresponding  to  the  transition  to  this  state,  to  form  the  new  survivor  path  for  this  state. 

During  a  preliminary  layout  to  find  the  approximate  size  and  shape  of  the  various  cells  in  the 
3  /xm  CMOS,  a  four  processor  Viterbi  decoder  was  determined  to  have  4mm  square  die.  Various 
other  design  parameters  were  decided  through  Monte-Carlo  simulations.  For  example,  simulations 
demonstrated  that  for  relatively  high  signal-to-noise  ratios,  a  6-bit  survivor  sequence  allowed  the 
sequences  to  merge  before  the  delayed  estimates  were  generated.  A  soft  decision  scheme  was 
adopted  and  again  the  simulations  showed  that  it  was  sufficient  to  maintain  state  metrics  to  4-bit 
resolution.  It  was  also  decided  to  generate  the  branch  metrics  off-chip  and  use  8  four-pin  ports  to 
bring  them  on-chip.  Although  this  meant  that  32-pins  of  a  40  pin  package  would  be  used,  it  allows 
for  the  flexibility  of  using  an  external  look-up  ROM  so  changes  in  the  branch  metric  are  facilitated. 
Besides  the  branch  metrics,  four  additional  connections  are  made  to  the  chip,  namely,  one  each  for 
power  and  ground,  one  for  the  system  clock  and  one  for  the  estimated  output  sequence  for  a  total 
of  36  pins.  The  VLSI  layout  is  presented  in  Fig.  5. 

The  operation  of  the  chip  is  as  follows.  As  shown  in  Fig.  4,  there  are  two  transitions  into  each 
state  or  processor.  Thus  each  processor  receives  the  two  corresponding  state  metrics  from  the  buses 
that  cross  the  center  of  the  chip  (see  Fig.  5).  Meanwhile,  the  off  chip  hardware  uses  observations  of 
the  received  symbol  to  look  up  a  ROM  table  and  generate  branch  metrics  for  both  the  transitions 
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into  that  state.  These  four-bit  values  arrive  via  busses  from  outside.  The  state  metric  and  the 
branch  metric  are  added  together  in  separate  4  by  6-bit  ripple  carry  adders  for  each  of  the  two 
input  paths.  The  two  sums  are  compared  by  a  6-bit  comparator  and  a  select  signal  is  generated, 
indicating  which  sum  is  lower.  While  these  operations  are  in  progress,  the  survivor  sequences  are 
delivered  to  the  appropriate  processing  elements.  The  6-bit  survivor  sequence,  selected  by  the  above 
comparison  is  saved  in  a  D-type  latch  on  the  rising  edge  of  the  clock.  The  select  line  then  gets  the 
proper  state  metric  into  the  normalizer.  Only  if  all  four  of  the  processors  select  state  metrics  with 
values  greater  than  15,  wiU  each  of  the  processors  normaUze  its  metric  by  subtracting  16  from  it. 
This  prevents  wrap-around  overflow  while  involving  only  the  upper  two  bits.  On  the  falling  edge 
of  the  clock,  the  6-bit  state  metric  is  latched,  thus  replacing  the  information  from  the  previous 
symbol  interval.  The  oldest  bit  of  the  survivor  sequence  is  the  estimate  of  the  received  symbol.  All 
other  bits  in  the  survivor  sequence  register  are  shifted  over  by  one  position  and  the  newest  bit  is 
appended  to  indicate  the  next  trellis  decision  in  the  path  history.  The  procedure  then  repeats  with 
the  next  received  symbol. 
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